The Information Sieve

نویسندگان

  • Greg Ver Steeg
  • Aram Galstyan
چکیده

We introduce a new framework for unsupervised learning of representations based on a novel hierarchical decomposition of information. Intuitively, data is passed through a series of progressively fine-grained sieves. Each layer of the sieve recovers a single latent factor that is maximally informative about multivariate dependence in the data. The data is transformed after each pass so that the remaining unexplained information trickles down to the next layer. Ultimately, we are left with a set of latent factors explaining all the dependence in the original data and remainder information consisting of independent noise. We present a practical implementation of this framework for discrete variables and apply it to a variety of fundamental tasks in unsupervised learning including independent component analysis, lossy and lossless compression, and predicting missing values in data. The hope of finding a succinct principle that elucidates the brain’s information processing abilities has often kindled interest in information-theoretic ideas (Barlow, 1989; Simoncelli & Olshausen, 2001). In machine learning, on the other hand, the past decade has witnessed a shift in focus toward expressive, hierarchical models, with successes driven by increasingly effective ways to leverage labeled data to learn rich models (Schmidhuber, 2015; Bengio et al., 2013). Information-theoretic ideas like the venerable InfoMax principle (Linsker, 1988; Bell & Sejnowski, 1995) can be and are applied in both contexts with empirical success but they do not allow us to quantify the information value of adding depth to our representations. We introduce a novel incremental and hierarchical decomposition of information and show that it defines a framework for unsuProceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). pervised learning of deep representations in which the information contribution of each layer can be precisely quantified. Moreover, this scheme automatically determines the structure and depth among hidden units in the representation based only on local learning rules. The shift in perspective that enables our information decomposition is to focus on how well the learned representation explains multivariate mutual information in the data (a measure originally introduced as “total correlation” (Watanabe, 1960)). Intuitively, our approach constructs a hierarchical representation of data by passing it through a sequence of progressively fine-grained sieves. At the first layer of the sieve we learn a factor that explains as much of the dependence in the data as possible. The data is then transformed into the “remainder information”, which has this dependence extracted. The next layer of the sieve looks for the largest source of dependence in the remainder information, and the cycle repeats. At each step, we obtain a successively tighter upper and lower bound on the multivariate information in the data, with convergence between the bounds obtained when the remaining information consists of nothing but independent factors. Because we end up with independent factors, one can also view this decomposition as a new way to do independent component analysis (ICA) (Comon, 1994; Hyvärinen & Oja, 2000). Unlike traditional methods, we do not assume a specific generative model of the data (i.e., that it consists of a linear transformation of independent sources) and we extract independent factors incrementally rather than all at once. The implementation we develop here uses only discrete variables and is therefore most relevant for the challenging problem of ICA with discrete variables, which has applications to compression (Painsky et al., 2014). After providing some background in Sec. 1, we introduce a new way to iteratively decompose the information in data in Sec. 2, and show how to use these decompositions to define a practical and incremental framework for unsupervised representation learning in Sec. 3. We demonstrate the versatility of this framework by applying it first to independent component analysis (Sec. 4). Next, we use the sieve The Information Sieve as a lossy compression to perform tasks typically relegated to generative models including in-painting and generating new samples (Sec. 5). Finally, we cast the sieve as a lossless compression and show that it beats standard compression schemes on a benchmark task (Sec. 6). 1 Information-theoretic learning background Using standard notation (Cover & Thomas, 2006), capital Xi denotes a random variable taking values in some domain and whose instances are denoted in lowercase, xi. In this paper, the domain of all variables are considered to be discrete and finite. We abbreviate multivariate random variables, X ≡ X1:n ≡ X1, . . . , Xn, with an associated probability distribution, pX(X1 = x1, . . . , Xn = xn), which is typically abbreviated to p(x). We will index different groups of multivariate random variables with superscripts, X, as defined in Fig. 1. We let X denote the original observed variables and we often omit the superscript in this case for readability. Entropy is defined in the usual way as H(X) ≡ EX [log 1/p(x)]. We use base two logarithms so that the unit of information is bits. Higher-order entropies can be constructed in various ways from this standard definition. For instance, the mutual information between two groups of random variables, X and Y can be written as the reduction of uncertainty in one variable, given information about the other, I(X;Y ) = H(X)−H(X|Y ). The “InfoMax” principle (Linsker, 1988; Bell & Sejnowski, 1995) suggests that for unsupervised learning we should construct Y ’s to maximize their mutual information with X , the data. Despite its intuitive appeal, this approach has several potential problems (see (Ver Steeg et al., 2014) for one example). Here we focus on the fact that the InfoMax principle is not very useful for characterizing “deep representations”, even though it is often invoked in this context (Vincent et al., 2008). This follows directly from the data processing inequality (a similar argument appears in (Tishby & Zaslavsky, 2015)). Namely, if we start with X , construct a layer of hidden units Y 1 that are a function of X , and continue adding layers to a stacked representation so that X → Y 1 → Y 2 . . . Y , then the information that the Y ’s have about X cannot increase after the first layer, I(X;Y ) = I(X;Y ). From the point of view of mutual information, Y 1 is a copy and Y 2 is just a copy of a copy. While a coarse-grained copy might be useful, the InfoMax principle does not quantify how or why. Instead of looking for a Y that memorizes the data, we shift our perspective to searching for a Y so that the Xi’s are as independent as possible conditioned on this Y . Essentially, we are trying to reconstruct the latent factors that are the cause of the dependence in Xi. To formalize this, we introduce the multivariate mutual information which was first introduced as “total correlation” (Watanabe, 1960). TC(X) ≡ DKL ( p(x)|| n ∏

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Experimental study of hydrodynamic characteristics of improved sieve tray with push valves

This paper addresses an experimental investigation in the hydrodynamic behavior of a modified slotted sieve tray. Slotted sieve tray (Push valve sieve tray) is a sieve tray that the push valves have been utilized on the tray deck to eliminate liquid gradients and non-uniformity of liquid distribution on the tray. The air-water system was used in an industrial scale experimental rig with an inte...

متن کامل

The Mordell-Weil Sieve

We discuss the Mordell-Weil sieve as a general technique for proving results concerning rational points on a given curve. In the special case of curves of genus 2, we describe quite explicitly how the relevant local information can be obtained if one does not want to restrict to mod p information at primes of good reduction. We describe our implementation of the Mordell-Weil sieve algorithm and...

متن کامل

Sieve: A Java-Based Collaborative Visualization Environment

We describe Sieve, a prototype Java-based collaborative environment for constructing visualizations interactively. Sieve allows collaborative construction of data-flow networks from an extensible set of modules. Modules may read data from a variety of sources, filter and transform the data in various ways, and generate visualizations. Annotation tools are also provided for mark-up and documenta...

متن کامل

Hydrodynamics of Sieve Tray Distillation Column Using CFD Simulation

Sieve trays are widely used in the gas- liquid contactors such as distillation and absorption towers. In this article, a three-dimensional, two phase CFD model using Euler-Euler framework was developed to simulate a distillation tower with two sieve trays. Hydrodynamic simulation of air and water system in different rates of gas phase was carried out and velocity distribution parameters, clear ...

متن کامل

Hydrodynamics of Sieve Tray Distillation Column Using CFD Simulation

Sieve trays are widely used in the gas- liquid contactors such as distillation and absorption towers. In this article, a three-dimensional, two phase CFD model using Euler-Euler framework was developed to simulate a distillation tower with two sieve trays. Hydrodynamic simulation of air and water system in different rates of gas phase was carried out and velocity distribution parameters, clear ...

متن کامل

Sieve Methods

Preface Sieve methods have had a long and fruitful history. The sieve of Eratosthenes (around 3rd century B.C.) was a device to generate prime numbers. Later Legendre used it in his studies of the prime number counting function π(x). Sieve methods bloomed and became a topic of intense investigation after the pioneering work of Viggo Brun (see [Bru16],[Bru19], [Bru22]). Using his formulation of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016